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Abstract 


Self-driving laboratories (SDLs) represent a cutting-edge concept in scientific 
research and experimentation. SDLs utilize automated instruments, 
recommendation algorithms, and an orchestration device to conduct experiments 
and analyze data without human intervention. Among the array of experiments 
conducted by SDLs, cyclic voltammetry (CV) and differential pulse voltammetry 
(DPV) are prominent, offering insights into electrochemical processes. However, 
efficiently extracting crucial information, such as overall shape and peaks, from CV 
and DPV data remains challenging. This thesis presents a novel encoding technique 
tailored for CV and DPV data to enhance SDLs' understanding of chemical 
environments. With this encoding method, SDLs can discern intricate patterns and 
relationships within the data more effectively. Experiments consisting of various 
machine learning tasks, such as clustering, classification, denoising, and synthetic 
data generation, that an SDL may encounter showed excellent results. Beyond 
SDLs, the utility of this encoding technique extends to any 2-dimensional data. Its 
versatility opens avenues for broader scientific and industrial applications, 
empowering researchers and practitioners to glean valuable insights from complex 
datasets. As SDLs continue to evolve, incorporating innovative methodologies such 
as this encoding technique promises to accelerate scientific discovery and advance 
technological frontiers. 
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Chapter 1 


Introduction 


Amidst urgent global challenges like climate change, energy sustainability, and health- 
care crises, there is a growing need for efficient solutions to address the needs of a 
growing population and increasing resource demands. Accelerating advancements 
in materials, technology, and scientific knowledge offer potential avenues for tack- 
ling these challenges. However, conventional research methods, marked by gradual 
progress and limited efficiency, may fall short of meeting the urgency posed by these 
issues. Self-driving laboratories (SDLs), which integrate laboratory automation and 
data-driven decision-making, emerge as promising tools to expedite and streamline 
the exploration of solutions while presenting several advantages over traditional sci- 
entific approaches |l] Developing a fully autonomous self-driving laboratory is a 
complex endeavour that combines various research disciplines. Machine learning and 
modelling techniques are utilized to forecast materials properties and propose new ex- 
periments. SDLs typically use optimization techniques to guide their decision-making 
algorithm. An example of this is Atlas, a Python library offering access to different 
optimization algorithms, which has been used to identify the voltage peak in CV 
experiments to optimize the oxidation potential of a set of metal complexes [2]. Con- 
currently, robotics, computer vision, and automated characterization methods are 
employed to conduct experiments and analyze outcomes. Integrating these disparate 
technologies into a cohesive platform is central to the design of autonomous labs, 


facilitating seamless interaction between experiments and computational modelling 
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i3]. 


SDLs can conduct experiments autonomously, performing tasks quicker and more pre- 
cisely than manual processes. Moreover, they utilize data-driven algorithms to nav- 
igate through experimental spaces, enabling efficient exploration based on feedback 
from existing data, a process known as “closed-loop” experimentation. Additionally, 
SDLs address issues such as reproducibility challenges and the underrepresentation 
of negative results in scientific literature by promoting the digitization of research 
processes. Through automated systems, experimental protocols are meticulously doc- 
umented, enhancing repeatability and reproducibility. Furthermore, digitization fa- 
cilitates comprehensive data recording and sharing, emphasizing the importance of 
negative or null results, thus providing a more accurate depiction of scientific endeav- 
ours. The wealth of high-quality data generated by autonomous experimentation 
is valuable for developing artificial intelligence (AI) in materials science and chem- 
istry. By improving machine learning (ML) and deep learning (DL) models, this data 
enhances the decision-making capabilities of SDLs, furthering their effectiveness in 


optimizing materials or processes and facilitating novel discoveries [1]. 


SDLs in chemistry and materials science are characterized by two critical dimen- 
sions: software autonomy and hardware autonomy. Regarding software autonomy, 
which governs experiment selection, SDLs are categorized into three types: (1) sin- 
gle iterations of automated experimentation with data-driven methods for selecting 
subsequent experiments, (2) multiple iterations within closed-loop systems where ex- 
perimental results inform subsequent rounds of automated experiments, and (3) gen- 


erative approaches involving numerous iterations of closed-loop optimization within 
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algorithmically generated search or chemical spaces. By automating high-throughput 
experimentation and streamlining experiment planning and execution, SDLs can sub- 
stantially accelerate research in chemistry and materials discovery. SDLs have played 
a pivotal role and made noteworthy advancements in various fields, including drug 
discovery, genomics, chemistry, and materials science [1]. Since SDLs solve many 
problems, they should be widely adopted to accelerate chemical research. However, 
the accessibility of SDLs is impeded by their complexity and the substantial finan- 
cial investment required for high-precision commercial platforms. Thus, there is a 


pressing demand for affordable alternatives to democratize SDLs [4]. 


Chapter 2 


Background and Motivation 


2.1 Electrochemistry 


Given the pivotal role of reduction-oxidation (redox) reactions in materials chemistry 
and industrial applications, electrochemistry stands as a primary beneficiary of ad- 
vancements in SDLs. According to the broad definition commonly accepted among 
researchers, electrochemistry encompasses the study of both the physical and chem- 
ical characteristics of ionic conductors, along with phenomena taking place at the 
interfaces between these ionic conductors and electronic conductors, semiconductors, 
other ionic conductors, and even insulating materials (such as gases and vacuum) [5]. 
The flow of electrons only occurs between two species, but the transfer of charge can 
also occur through an oxidation-reduction reaction. When a substance loses an elec- 
tron, its oxidation state increases, indicating oxidation. When a substance acquires 
an electron, its oxidation state decreases, indicating reduction. For example, consider 


the following redox reaction, which has oxidation and reduction components: 


Hə > 2H* +2e Oxidation (2.2) 
F3+2e + 2F Reduction (2.3) 
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A redox reaction is balanced when the number of electrons gained by the oxidant 
equals the number of electrons lost by the reductant. Like any balanced chemical 
equation, the entire process is electrically neutral, meaning that the net charge re- 
mains consistent on both sides of the equation. With redox reactions, it is possible 
to separate the oxidation 2.2 and reduction 2.3 half-reactions physically in space, 
provided a complete circuit exists using an external electrical link, such as a wire, 
connecting the two halves. Electrons migrate from the reductant to the oxidant as the 


reaction progresses through this electrical connection, generating an electric current. 


Electrochemical cells are devices that use redox reactions to generate electricity or 
use electricity to drive non-spontaneous redox reactions. This device effectively trans- 
forms chemical energy into electrical energy or vice-versa. In an electrochemical cell, 
reduction and oxidation reactions occur at the electrodes. The electrode where reduc- 
tion occurs is termed the cathode, while oxidation occurs at the anode. An electrode 
serves as a stable electrical conductor, facilitating the flow of electrical current within 
non-metallic solids, liquids, gases, plasmas, or even vacuums. Electrodes are typically 
fabricated from highly conductive materials, including but not limited to metals and 
graphite [6]. In a battery, redox reactions create a flow of electrical current that can 


be used to power electronic devices. 


Electrode potential is the voltage of an electrochemical cell composed of a refer- 
ence electrode and another electrode to be characterized. Figure 2.1 shows a three- 
electrode setup typical for electrochemical experiments such as cyclic voltammetry. 
During the flow of current between the working and counter electrodes, the reference 


electrode is used to precisely measure the applied potential in relation to a stable 
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Figure 2.1: Schematic of Electrochemical Cell [7] 


reference reaction. A potentiostat, as shown in Figure 2.2, is an analytical instru- 
ment designed to control the potential between the working electrode and counter 
electrode within a multi-electrode cell [8]. The potentiostat contains various inter- 
nal circuits tailored to fulfil this role, facilitating the generation and measurement of 
potentials and currents. External wires within a cell cable establish connections be- 
tween the potentiostat circuit and the electrodes within the electrochemical cell. In a 
three-electrode configuration, the cell cable links the working, counter, and reference 
electrodes on one terminal and the potentiostat cell cable connector on the opposite 


end. The potentiostat's internal circuitry governs the applied signal. 
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Figure 2.2: Potentiostat Circuit Diagram 


The working electrode performs the electrochemical event of interest. Since reactions 
occur at the cathode and anode surfaces, it is crucial that the surface is spotless and 
that the surface area is well-defined. The working electrodes should be immediately 
polished after use to ensure there are no surface contaminants that inhibit electron 
transfer. Even a few hours of air exposure will degrade the electrode surface. Detect- 
ing when surface contamination affects data quality is one of the questions this work 
addresses. This detection can trigger automatic polishing or replacement with a new 


disposable electrode [9]. 


Commercial vendors commonly provide potentiostats that are governed by propri- 
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etary software, employ graphical user interfaces (GUI), and produce already curated 
data. These devices are widely used for electroanalytical experiments such as cyclic 
voltammetry and differential pulse voltammetry. Commercial potentiostats can vary 
in design, but a typical potentiostat is shown in Figure 2.2 and consists of three com- 
ponent circuits: a control circuit, an electrometer, and a current follower [10]. The 
electrometer circuit utilizes a differential amplifier to measure the difference in poten- 
tial between the working and reference electrodes. Subsequently, the measured po- 
tential feeds into the control circuit, which administers a current through the counter 
electrode, altering the relative potential of the working electrode to align with the 
user-defined parameters. A single generator ensures this potential adheres to a prede- 
fined periodic waveform. The current flowing through the working electrode is then 
assessed by a current follower circuit, commonly in the form of a current-to-voltage 
converter. ‘This circuit measures the drop in potential across a grounded resistor, 


allowing the current to be determined using Ohm's law. 


However, commercial potentiostats present challenges for integration into automated 
systems due to their reliance on proprietary software and GUIs. Furthermore, their 
high cost poses a significant barrier for groups seeking to perform high-throughput 
analysis. To address these issues, Pablo-García et al. recently introduced an open- 
source, low-cost potentiostat [11]. Coupled with the synthesis platform reported in 
the same work, this innovative device aims to democratize electrochemical analysis 
by reducing the financial barrier to entry and improving integration with automa- 
tion systems. By providing an affordable and accessible alternative to traditional 


commercial potentiostats, this open-source solution empowers researchers to conduct 
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electrochemical experiments with greater flexibility and efficiency. Despite its re- 
markable advancements, the device's precision falls short of commercial standards. 
As such, we later explore various machine-learning methodologies to enhance data 


quality. 


2.2 Cyclic Voltammetry 


Normalized Current 


0.0 0.2 0.4 0.6 0.8 1.0 
Normalized Potential 


Figure 2.3: Cyclic Voltammogram 
A typical duck-shape is shown with points of interest labelled. 


Cyclic voltammetry (CV) is a common electrochemical characterization technique 
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that extracts essential redox information about molecules [12]. Typically, the working 
electrode potential increases linearly with time. After reaching a predetermined limit, 
the potential decreases to return to the starting voltage. ‘These cycles can be repeated 
as frequently as needed to bolster confidence in the obtained data. The rate of voltage 
change over time is known as the experiment's scan rate (Voltage/ Time) and affects 


how many data points are gathered throughout the experiment [13]. 


CV is valuable for studying qualitative information about electrochemical processes 
across diverse conditions. It enables the examination of intermediates in oxidation- 
reduction reactions and the assessment of reaction reversibility. Other use cases 
include the determination of electron stoichiometry, analyte diffusion coefficients, and 
formal reduction potentials, aiding in identification processes |14]. Additionally, in 
reversible Nernstian systems, the proportional relationship between concentration 
and current allows for determining unknown solution concentrations by constructing 


calibration curves correlating current and concentration [15]. 


In a typical cyclic voltammogram shown in Figure 2.3, peaks represent electrochemical 
processes occurring at the electrode surface. The anodic peak (Epa) is observed 
during the scan where oxidation of the electroactive species occurs at the electrode 
and corresponds to the potential at which oxidation is most favourable. The current 
increases as the potential applied to the electrode becomes more positive, reaching 
a maximum at the peak potential. The cathodic peak (Ep) is observed during 
the reverse scan where reduction of the electroactive species occurs at the working 
electrode and corresponds to the potential at which reduction is most favourable. 


The current increases as the potential becomes more negative, reaching a maximum 
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at the peak potential [16]. Typically, chemists are especially interested in these peaks 


as they condense the redox behaviour of the analyzed compound [17]. 


2.3 Differential Pulse Voltammetry 


Potential 


Time E | Potential 


Figure 2.4: Differential Pulse Voltammogram 
(A) shows increasing pulses while (B) shows the resulting voltammogram. 


Differential Pulse Voltammetry (DPV) is a more sophisticated electrochemical mea- 
surement technique where a series of increasing pulses are applied across the electrodes 
in an electrochemical cell [18]. The current Jı is measured right before applying the 
pulse at time tı, and 1, is measured again at the end at time to. The difference in 
current, AJ = (1 — 1), is plotted against the potential and results in a peak-like 
shape. This method helps reduce the impact of charging current by sampling the 
current just before the potential change. DPV is well suited for measurements with 
extremely low concentrations of chemicals. This is because the effect of the charging 
current can be minimized to achieve high sensitivity, and only the faradaic current, 


the electric current generated by the redox of a chemical at an electrode, is extracted 
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so that electrode reactions can be measured precisely [19]. Furthermore, DPV is a 
versatile tool for qualitatively characterizing chemical compounds and their electro- 
chemical properties [18]. By analyzing the shape, position, and area of the peaks in 
the DPV curve, chemists can glean insights into the nature of the electroactive species 
present, their concentration, kinetics of electron transfer processes, and other relevant 
electrochemical parameters. This capability makes DPV invaluable in various fields, 
such as analytical chemistry, environmental monitoring, and pharmaceutical research, 
where understanding the behaviour of chemical compounds at the molecular level is 


crucial [18]. 
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Chapter 3 


Clustering 


3.1 Introduction 


Given the capabilities and limitations of SDLs, quick and accurate characterization 
of the electrical compounds produced is needed. Clustering experimental results be- 
comes crucial for several reasons. Clustering identifies patterns and similarities among 
experimental results. This aids in the discovery of underlying trends or relationships 
between different compounds or experimental conditions. It also facilitates quality 
control by pinpointing outliers or anomalies in experimental data, ensuring the reli- 


ability of SDL data. 


Moreover, clustering allows researchers to optimize processes by providing insights 
into the effects of various parameters, such as metal/ligand ratio, on the formation of 
redox or electrochemical compounds. This optimization can significantly enhance the 
efficiency of voltammetry automation in SDLs. Additionally, by classifying different 
types of electrical compounds based on their properties or characteristics, clustering 
supports classification and prediction tasks, enabling researchers and SDLs to predict 
the behaviour of new compounds or classify unknown compounds based on their sim- 
ilarities to known clusters. Categorizing the automated characterizations typically 
done by SDLs can improve the automation capabilities of the laboratory. This is 


done by improving the behaviour mapping of a compound under certain conditions 
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into a chemical space. Finally, clustering provides a structured way to organize and 
interpret large volumes of experimental data, facilitating decision-making processes 
related to the selection of compounds for further analysis or the design of future ex- 
periments. In essence, clustering experimental results in the context of SDLs used for 
voltammetry is indispensable for gaining insights, ensuring data quality, optimizing 
processes, classifying compounds, and facilitating decision-making processes, which 


are crucial in an SDL environment. 


3.2 Data Collection 


To analyze how data gathered from SDLs can be clustered, this work uses an open 
dataset published by the Aspuru-Guzik group [11]. The data was collected through 
autonomous electrochemistry experimentation that operates through an iterative 
workflow [11]. The workflow was used to synthesize 10 distinct metals and 10 distinct 
ligands, with specific details available in Appendix A.1 and Appendix A.2, resulting 
in 100 unique complexes. Each complex was synthesized using a metal/ligand concen- 
tration ratio of 1:7 to ensure complete complexation. The synthesis process employed 
1.0 M NaCl in water as the electrolyte/solvent and a buffer solution consisting of a 
1:1 ratio of HOAc/NaOAc. Following synthesis, comprehensive characterizations were 
conducted using CV and DPV techniques. The experimentation was done using a 
low-cost electrochemistry platform designed as an alternative to commercial options. 
The number of points in each sample can vary due to different scan rates or voltage 


windows. Higher scan rates lead to more data points being collected during the ex- 
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periment and can provide finer resolution of the electrochemical processes occurring. 
Additionally, it's worth noting that these samples may be duplicated as CV and DPV 
analyses can be conducted multiple times on the same sample to ensure robustness. 
Notably, the workflow is adaptable, with the potential to encompass a broader range 
of parameters, including additional ligands, varying metal/ligand ratios, and reaction 
times. Mixed ligands and different buffer pH levels can also be configured but was not 
done so for this dataset. The accumulation of data points is ongoing, contributing 
to the continuous expansion and refinement of our understanding. The final dataset 
consists of 800 CV and 200 DPV data points. The dataset used in this work can be 


found on Zenodo. 


3.3 Curse of Dimensionality 


The curse of dimensionality refers to the phenomena that cause various challenges 
and complications when analyzing data in high-dimensional spaces. As the number 
of features in a dataset expands, the volume of data required to generalize accurately 
grows exponentially |20]. With each additional dimension, the data becomes sparser, 
posing significant hurdles for tasks like clustering and classification. The distinc- 
tion between distances among data points diminishes in higher dimensions, making 
measurements like Euclidean distance negligible. As such, algorithms that rely on 
distance measurements will experience a drop in performance. Furthermore, more 
dimensions will require more computational resources and time to process the data. 


It is good practice to aim to have the data in as low a dimension as possible, provided 
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that relevant information is maintained. 


3.4 Ramer-Douglas-Peucker Algorithm 


Original Curve Simplified Curve 
1.0 1.0 
0.5 0.5 
- 00 > 0.04 —— Simplified Curve (epsilon=0.2) 
—0.5 —0.5 4 
—1.0 —1.0 4 
0 2 4 6 8 10 12 0 2 4 6 8 10 12 
X X 


Figure 3.1: RDP Algorithm 
The effects of the RDP algorithm are shown, with the original curve on the left and 
the simplified curve on the right. 


The Ramer-Douglas-Peucker (RDP) algorithm reduces the number of points in a 
curve approximated by a series of points [21]. It operates by conceptualizing a line 
between the initial and terminal points within the curve's points. Subsequently, it 
identifies the point furthest from this line among the intermediary points. If this 
point, termed the outlier point, and consequently all intervening points, lie within a 
specified distance e from the line, they are removed. Conversely, suppose the outlier 
point exceeds the e distance. In that case, the curve is segmented into two parts: 


from the initial point to the outlier point, inclusive and the outlier and the remaining 
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points. The algorithm is then recursively applied to both resulting segments, and the 
reduced forms of the curve are reassembled. Figure 3.1 shows curve simplification 
done with the RDP algorithm. Since CV and DPV results can be represented as a 
curve, RDP can be used to remove unnecessary points while maintaining the overall 
shape of the voltammogram. This will reduce the dimensionality and improve data 


analysis results. 


3.5 Data Preparation and Encoding 


Many parameters can be set during CV and DPV analysis, affecting the characteri- 
zation outcomes. Notably, the experiment's scan rate affects the sampling frequency 
and the number of points collected within a specific time interval, leading to a vari- 
able number of point densities depending on the analyzed compound. Heterogeneity 
among samples becomes challenging for many ML algorithms, as they often require 
input data to be the same shape. Similarly, the potential limit at which the po- 
tential begins to return to its initial point will affect the overall shape of the cyclic 
voltammogram. 'To address these issues, the following steps are used to encode the 


data: 


1. Split experiment cycles into separate data points. 
2. Normalize values to fit between [0, 1]. 
3. Reduce points using the Ramer-Douglas-Peucker algorithm. 


4. Duplicate data points until the total length reaches the longest cycle's length. 
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5. Order data points based on angular position relative to the center. 


Due to the curse of dimensionality, the RDP algorithm is used to reduce the number 
of dimensions. Since the RDP algorithm takes only a variable e, the final length 
after reduction will differ for each data set. Data points are randomly selected and 
duplicated to ensure the data is the same size as the longest cycle after RDP reduction. 
Finally, the data is ordered based on its angular position relative to the center for 


consistency. Plots of the raw and processed data can be seen in Figure 3.2, with 


1e-7 Raw Data Processed Data 


Current (V) 
Normalized Current 


—1.0 —0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0 
Potential (A) Normalized Potential 


Figure 3.2: Raw Data and Encoded Data 


the colour of the scatter plot representing time sequence of the points. The starting 
point and end point varies across different voltammetry experiments. As such, it 
is essential to order the data points so that comparisons can be informative. A 


significant reduction in dimensionality by 1/3 can also be seen in the plots. Despite 
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this, the critical characteristics of the voltammogram, such as the overall shape and 


peaks, are maintained. 


3.6 K-Means 


UMAP Dimension 2 


-6 —4 -2 0 2 4 6 8 10 
UMAP Dimension 1 


Figure 3.3: K-Means Clustering Visualization with CV Data 
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Due to the dimensionality of the data, UMAP (see subsection 3.9) was applied to 


reduce the dimensions of the data. 


K-Means clustering is an unsupervised machine learning algorithm aimed at dividing 
data points into clusters so that the data points within each cluster are similar and 


different from those in other clusters [22]. The K-Means clustering result for CV data 


19 


Chapter 3. Clustering 


can be seen in Figure 3.3. The algorithm is explained below, with K representing the 


desired number of clusters: 


1. Initially, K points are selected randomly as the cluster centroids. 


2. Each data point is assigned to the closest mean, quantified by the Euclidean 


distance. 


3. Each cluster centroid is updated to reflect the average of data points currently 


assigned to that cluster. 


4. This process is repeated for a specified number of iterations. 


One of the questions that needs to be answered is the choice of K. This means finding 
a balance between the number of clusters represented by K and the average variance 
of the clusters while minimizing both. There is no approach for determining K that 
works better than all others. For this problem of clustering CV and DPV data, the 
Elbow Method [23] and the Silhouette method are used [24]. The Elbow Method is 
performed by plotting the within-cluster sum of squares (WCSS) for a range of K and 
choosing the value K where adding more clusters does not significantly decrease the 
WCSS. While the Elbow Method can quickly eliminate many values of K, it also has 
drawbacks regarding the shape of the WSCC curve. Determining the exact location of 
the *elbow" can be subjective and depends on the analyst's interpretation. Different 
individuals may identify different elbows, leading to inconsistency in results. In cases 
where the relationship between the number of clusters and WCSS is not distinctly 
elbow-shaped, the Elbow Method may not provide clear guidance for choosing the 


appropriate number of clusters. 
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The Silhouette Method addresses some of these drawbacks by providing a more quan- 
titative measure of cluster quality. Instead of relying on subjective interpretation, the 
Silhouette Method calculates the silhouette coefficient for each data point, quantify- 
ing how similar an object is to its cluster compared to others. This provides a more 
objective measure of cluster cohesion and separation. The process for selecting K for 
this work includes determining a set of candidate K values using the Elbow Method by 
eliminating suboptimal values and then using the Silhouette method to find optimal 


K among the potential candidates. 


3.7 Density-Based Spatial Clustering of Applications with 


Noise 


Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is another 
clustering algorithm that works by partitioning the data into dense regions of points 
separated by less dense areas |25]. It defines clusters as areas of the dataset with many 
points close to each other, while the points far from any cluster are considered outliers 
or noise. In DBSCAN, e represents the maximum distance between two points for 
them to be considered neighbours, and minimum samples represents the number of 
points required for a point to be considered a core point. Points that have fewer than 
minimum samples points are labelled as noise. The key differentiator for DBSCAN 


is that the number of clusters does not need to be determined beforehand. 
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Figure 3.4: DBSCAN Clustering Visualization with CV Data 
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UMAP (see subsection 3.9) was applied to reduce the dimensions of the data. 


3.8 t-Distributed Stochastic Neighbour Embedding 


Dimensionality techniques like t-Distributed Stochastic Neighbour Embedding (t- 


SNE) are used for visualizing high-dimensional data in a low-dimensional space [26]. 


This visualization can aid in the clustering process by providing insights into the un- 


derlying structure of the data and help in understanding the results of the clustering 


algorithm. The first step of the algorithm is to create a probability distribution that 


represents the similarity between neighbours. The similarity between the two data 


points is represented by their Euclidean distance. Each data point is placed in the 


middle of the Gaussian curve, and the rest of the data is placed along the curve. This 
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is represented by the following equation where j 4 i and pa: = 1: 


exp(— ||v; — x;|[2/207) 
Lagi erp (lx; — 2411? /20;) 


Pili = (3.1) 


“The similarity of datapoint x; to datapoint x; is the conditional probability, p;p, that 
x; would pick x; if neighbours were picked in proportion to their probability density 
under a Gaussian centred at x;” [26]. The remaining variable, sigma, is not chosen 


directly but rather by selecting a value for perplexity. Perplexity is defined as: 
Perp(p) := 27 22, r(losa((2)) (3-2) 


Perplexity represents the density of data and how many neighbours the central point 
should have with higher values relating to higher variance. After choosing the per- 
plexity value, the corresponding sigma values are found using binary search. Next, 
the similarities between data points for low-dimensional representations must also be 


found to ensure that similar data are close after projection. 


3.9 Uniform Manifold Approximation and Projection 


Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction 
technique that can be used for visualization similarly to t-SNE [27]. It achieves this 
by leveraging concepts from algebraic topology and Riemannian geometry. Here's a 


simplified breakdown of how UMAP works: 
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1. Constructing a Topological Representation: UMAP starts by creating a fuzzy 
topological representation of the data. This involves building a simplicial com- 
plex, which is a way to represent topological spaces using simple geometric 
shapes called simplices. The algorithm constructs these simplices based on the 


proximity of data points. 


2. Optimizing Low-Dimensional Representation: Once the topological represen- 
tation is established, UMAP optimizes a low-dimensional representation of 
the data to match this topological structure as closely as possible. It does 
this by minimizing a measure called cross-entropy, which quantifies the dif- 
ference between the fuzzy topological structures of the high-dimensional and 


low-dimensional data. 


3. Efficient Computations: UMAP employs several strategies to make computa- 
tions efficient. It focuses on computing only the nearest neighbours of each 
point and uses algorithms like Nearest-Neighbour-Descent for this purpose. Ad- 
ditionally, it utilizes stochastic gradient descent for optimization and smooth 


approximations of the membership strength function to ensure differentiability. 


4. Preserving Topological Structure: The goal of UMAP is to ensure that the low- 
dimensional representation maintains the essential topological properties of the 
original data. It achieves this by balancing attractive forces that pull similar 
points together and repulsive forces that push dissimilar points apart based on 
the weights of edges in the topological representation. The farther away the 


two points are, the more dissimilar they are. 
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Overall, UMAP's combination of computational efficiency, scalability, flexibility in 


parameter tuning, and interpretability make it an appealing option over t-SNE. 


3.10 Results and Discussion 


The K-Means clustering algorithm was used to categorize the entire set of experimen- 
tal voltammetry data after encoding. With K-Means, a value of K will need to be 
selected. This is done using the elbow method. Figure 3.5 shows the results of the 
elbow method applied to the dataset. It can be seen that this methodology identi- 
fies multiple potential candidates for K-values, necessitating a more comprehensive 
analysis to select the most appropriate option. The Silhouette method is used to 
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Figure 3.5: K-Means Elbow Method 
Plots show no obvious choice for K. 


analyze promising values to aid the decision-making process. A cluster with a value 
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of one means points are perfectly assigned in a cluster, and clusters are easily dis- 
tinguishable. Zero means clusters overlap; negative one means points are assigned to 
the wrong cluster. The K value should be chosen based on which value produces the 
most clusters with Silhouette scores greater than the average score of the dataset, 
represented by the red-dotted line seen in Figure 3.6a and Figure 3.6b. Furthermore, 
there should not be wide fluctuations in the size of the clusters. The width of the 
clusters represents the number of data points belonging to the cluster. In Figure 3.6a, 
which showcases the application of the Silhouette method for CV cross-validation, K 
— 38 yields the highest number of clusters with a score surpassing the mean of the 
dataset. This configuration reduces the number of clusters scoring below zero and 
minimizes the variance in cluster sizes. Similarly, Figure 3.6b showcases the Silhou- 
ette method for DPV, K = 42 results in the best quality of clusters. A subset of the 
cluster results is available in the appendix. Despite having 100 different combinations 
of metals and ligands, using a relatively small K value still shows promising results, 
as the data points within each cluster have similar overall shapes, which is crucial for 


compound identification. 


DBSCAN, as an alternative clustering method, demonstrated significant promise in 
identifying anomalous data points. With the appropriate parameters, DBSCAN effi- 
ciently grouped cycles from the same experiment. As Figure 3.7 depicts, DBSCAN 
defines clusters comprising of cycles solely from a single experiment. This capabil- 
ity could be seamlessly incorporated into SDLs as an error validation mechanism. 
Any cycle not assigned to the same cluster as others from the identical experiment 


could trigger an error notification, prompting intervention and investigation. An- 
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(b) DPV Silhouette Method 


Figure 3.6: Silhouette Method for Promising K Values 


other method is identifying the erroneous clusters and investigating the assigned data 


points, as seen in Figure 3.8. To further demonstrate the efficacy of the encoding, 
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(d) Cluster 4 


Figure 3.7: DBSCAN Clusters 


t-SNE and UMAP projections are created to visualize the data in 2-D and show how 


the shapes, metals, and ligands are distributed. 


As seen in Figure 3.9 and Figure 
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Figure 3.8: UMAP Visualization of Normal CV Data and Incorrectly Shaped Data 


3.10, t-SNE emphasizes local structure and tends to agglomerate similar data points 
into tight clusters. As a result, t-SNE plots often show more apparent separation 
between clusters but may not preserve the global structure as effectively. t-SNE 
primarily preserves local neighbourhoods, which leads to tighter clusters of similar 
points. However, it may not always capture the global structure accurately, espe- 
cially for complex datasets. t-SNE embeddings can vary significantly with different 
random initializations and parameter choices, making it less stable and potentially 


more sensitive to noise in the data. 
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Figure 3.11: Cyclic Voltammetry UMAP Projection 
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Figure 3.12: Differential Pulse Voltammetry UMAP Projection 
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UMAP tends to focus more on preserving global structure and maintaining relative 
distance between clusters. Therefore, clusters in the UMAP plot are usually well- 
separated and evenly distributed. UMAP tries to maintain local and global neigh- 
bourhoods, resulting in more evenly spaced clusters and better representation of local 
and global structures. UMAP embeddings are generally more stable across different 
runs and parameter settings than t-SNE. Figures 3.11 and 3.12 illustrate these char- 
acteristics, especially when contrasted with the projections generated by t-SNE. For 
example, Figure 3.11 clearly shows more evenly spaced clusters than Figure 3.9. In- 
teractive plots, as seen in Figure 3.13, made with Bokeh, are available on GitHub. 
The corresponding voltammetry plot is shown when hovering over a point with the 
mouse. Pan, zoom, and rotation tools are also available. These plots are extremely 


useful for chemists as SDLs can automatically generate them. 


Utilizing machine learning techniques to classify voltammetry data based on the over- 
all shape presents numerous benefits over merely employing a script to identify the 
number of peaks, as has traditionally been done. Machine learning models can be 
trained to recognize patterns and variations regarding the overall shape and num- 
ber of peaks. They can adapt to experimental conditions, electrode materials, and 
analytes without needing manual adjustment of parameters. Voltammetry data can 
often be noisy, especially at low concentrations. ML models can be trained to dis- 
tinguish true peaks from noise more effectively than simple peak-finding algorithms. 
Voltammograms can vary in characteristics due to electrode deterioration, surface 
roughness, and solution composition. ML models can learn to handle this variability 


and provide more reliable peak classification across different experimental conditions. 
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Additionally, ML models can learn when the electrode deteriorates and automati- 
cally polish or replace it. ML models can automatically extract relevant features 
from voltammogram data, such as peak heights, peak widths, peak potential, and 
overall shape. This allows for more comprehensive analysis beyond locating peaks. 
Once trained, ML models can be integrated into larger data analysis pipelines to 
classify cyclic voltammetry data rapidly and efficiently, potentially saving time and 
effort compared to manual analysis or parameter tuning for peak-finding algorithms. 
These automated analysis techniques can be integrated into an SDL, updating them 


with newly generated data to increase accuracy as more experiments are performed. 


In summary, clustering techniques are crucial in analyzing and interpreting experi- 
mental voltammetry results obtained from SDLs. By organizing data into meaningful 
clusters, clustering techniques like K-Means and DBSCAN and dimensionality reduc- 
tion techniques like t-SNE and UMAP uncover patterns, similarities, and trends that 
enhance our understanding of the electrochemical compounds of interest. Choosing 
appropriate clustering algorithms and parameter selection methods, such as the El- 
bow and Silhouette Method, have been discussed to ensure meaningful and reliable 
clustering results. The results obtained from clustering algorithms and dimensional- 
ity reduction techniques have provided valuable insights into the underlying structure 
of the experimental data, facilitating compound identification, error detection, and 


decision-making processes in SDLs. 
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Figure 3.9: Cyclic Voltammetry t-SNE Projection 
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Figure 3.10: Differential Pulse Voltammetry t-SNE Projection 
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UMAP projection with CV data 
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Figure 3.13: Bokeh Interactive Plot with CV Data 
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Classification 


4.1 Introduction 


A classifier was trained to predict what metals and ligands were used to generate 
each voltammogram to demonstrate further the feasibility of using this encoding 
technique for various machine learning tasks. Although the ligand and metal labels are 
already known, promising results demonstrate that the encoding technique effectively 
captures the chemistry behind the measurements. The exact definitions of each metal 
and ligand are available in the Appendix 6.1. It is important to note that the dataset 
used is relatively small for a deep learning task. A general rule of thumb is that 
there should be at least 10 times as many data points as features [28]. This rule 
is not satisfied as the dataset consists of 800 CV and 200 DPV samples, with 1200 
and 450 dimensions, respectively. The dataset was split into 80% for training, 10% 
for validation, and 1096 for testing. 5-fold cross-validation, a technique for assessing 
the performance of a machine learning model by dividing the dataset into k subsets, 
training the model on k-1 subsets, and evaluating it on the remaining subset for each 
subset, is also used [29]. An important insight to consider is the similarity between 
voltammetry data and images. After all, each point has a potential and current 
value, similar to an image's RGB values. The main difference is that an image is 
2-dimensional while voltammetry data is 1-dimensional. Previous works have used 


convolutional neural networks (CNNs) for classification tasks [30]. Using this as 
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inspiration, the proposed model architecture for voltammetry data classification uses 


1-dimensional convolutional layers. 


4.2 Variational Autoencoders 


Figure 4.1: Autoencoder Diagram 
Input is encoded into latent space and then recreated using decoder 


Since the dataset size is a significant challenge, creating synthetic data is one method 
to address this. A variational autoencoder (VAE) is similar to the autoencoder neural 
network architecture shown in Figure 4.1, with the main difference being that VAEs 
connect the encoder to its decoder through a probabilistic latent space corresponding 
to the parameters of a variational distribution [31]. The encoder maps each point 
from the dataset into a distribution within the latent space rather than a single point 
in that space. The distribution is typically Gaussian with a mean and a variance. 
Once the VAE is trained, different points can be sampled from the learned latent 


space distribution. These samples represent different configurations of the input data 
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in the latent space. The sampled points from the latent space are fed into the de- 
coder network, which reconstructs the input data corresponding to those points and 
generates a diverse set of synthetic data samples that resemble the original data dis- 
tribution. The variability in the latent space allows for the generation of novel and 


diverse data samples that capture the underlying characteristics of the training data. 


4.3 Conditional Variational Autoencoders 


While traditional VAEs learn a latent space for the dataset, conditional variational 
autoencoders (CVAEs) expand this concept by introducing conditional dependen- 
cies between the input data and the latent variables. In the context of generating 
synthetic data, CVAEs offer a more controlled approach by allowing the generation 
process to be conditioned on additional information, such as class labels or other at- 
tributes associated with the data. By conditioning the generation process on known 
attributes or labels, CVAEs can generate synthetic data samples that capture the 
underlying data distribution and adhere to specific conditions or constraints defined 
by the conditioning variables. This enables the targeted generation of synthetic data 
for different classes or categories when labelled data is lacking. In this case, the metal 
and ligand are encoded using one-hot encoding and passed to the decoder to generate 


data from the same class. 
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Figure 4.2: Classification Model Architecture 


4.4 Classifier Model Architecture 


The classifier architecture can be seen in Figure 4.2. The model consists of several 


convolutional and max-pooling layers to encode the data and reduce dimensions. All 
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layers except the output layer use the ReLU activation function [32]. The output layer 
is a dense layer with ten units, one for each metal/ligand, and a softmax activation 
function [33]. The Adam optimizer [34] and categorical cross-entropy loss are used to 
train the model. Additionally, the model uses L2 regularization and early stopping 
to prevent overfitting and ensure smooth convergence. The Glorot uniform initializer 
[35] is used for weight initialization to facilitate better gradient flow and prevent 


exploding gradients. 


4.5 Results and Discussion 


Model Fold 1 Acc | Fold 2 Acc | Fold 3 Acc | Fold 4 Acc | Fold 5 Acc | Avg Acc 
CV Ligands 15.1396 80.2996 69.65% 76.40% 78.82% 76.06% 
CV Metals 79.24% 82.50% 81.36% 80.11% 78.97% 80.44% 

DPV Ligands | 30.00% 35.00% 25.00% 25.00% 30.00% 29.00% 
DPV Metals 25.00 30.00% 25.00% 20.00% 25.00% 25.00% 


Table 4.1: Classification Accuracy 


Separate classifiers were trained, each with a unique task of classifying CV ligands, 
CV metals, DPV ligands, and DPV metals. The accuracy of the classifiers can be 
seen in Table 4.1, and the results were much better for the CV data than the DPV 
data. This difference can likely be attributed to the size of the datasets. After 
incorporating synthetic data generated with the CVAE into the training process, 
accuracy significantly improved for classifying CV data, as seen in Table 4.2. However, 


the DPV ligands classifier saw a decrease in performance. Again, this is likely due 
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Model Fold 1 Acc | Fold 2 Acc | Fold 3 Acc | Fold 4 Acc | Fold 5 Acc | Avg Acc 
CV Ligands 77.86% 78.50% 80.94% 72.01% 73.02% 76.47% 
CV Metals 85.00% 81.23% 84.19% 83.10% 78.75% 82.45% 

DPV Ligands | 25.00% 25.00% 15.00% 20.00% 20.00% 21.00% 
DPV Metals 25.00% 20.00% 20.00% 25.00% 25.00% 23.00% 


Table 4.2: Classification Accuracy with Synthetic Data 


to the size of the dataset. Several key considerations impact the quality of data 
generated by VAEs, especially when dealing with small datasets. Firstly, the quality 
and diversity of the original data influence the effectiveness of the synthetic data 
produced by VAEs. With limited variation or complexity in a small dataset, the 
VAE might struggle to accurately capture the proper underlying data distribution, 
potentially resulting in synthetic data that fails to fully represent the characteristics 


of the actual data. This mismatch can detrimentally affect classifier performance. 


Additionally, the risk of overfitting is heightened in small datasets, where the classi- 
fier may excessively specialize in training data patterns that do not generalize well. 
Introducing synthetic data from a VAE can compound this issue if the VAE over- 
fits the small dataset, producing synthetic data that is overly similar to the training 
data, which provides minimal additional information for the classifier and can lead 
to decreased performance on unseen data. VAEs implicitly learn the probability 
distribution of the input data. However, suppose the actual data distribution is sig- 
nificantly different from the distribution learned by the VAE due to the small dataset 
size. In that case, the synthetic data generated by the VAE may not accurately 


represent the true data distribution. This distribution mismatch can confuse the 
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classifier, as it may encounter data points in the synthetic dataset that deviate from 


the real data distribution, leading to suboptimal performance. Table 4.3 provides 


Precision | Recall | F1-Score | Support 

Metal 1 0.88 0.88 0.88 8 
Metal 2 0.80 1.00 0.89 8 
Metal 3 1.00 1.00 1.00 4 
Metal 4 1.00 0.83 0.91 12 
Metal 5 1.00 0.71 0.83 7 
Metal 6 0.88 0.78 0.82 9 
Metal 7 0.82 0.90 0.86 10 
Metal 8 0.50 0.40 0.44 5 
Metal 9 0.78 1.00 0.88 7 
Metal 10 0.82 0.90 0.86 10 
Accuracy 0.85 80 
Macro Avg 0.85 0.84 0.84 80 
Weighted Avg 0.86 0.85 0.85 80 


Table 4.3: CV Metals Classification Report 


insights into the precision, recall, and F1-score when classifying each CV metal type, 
and the number of instances (support) for each metal type. Precision indicates the 
proportion of true positive predictions among all positive predictions, while recall 
measures the proportion of true positives that were correctly identified. F1-score, the 
harmonic mean of precision and recall, provides a balanced measure between the two. 
Overall, the classifier model achieved an accuracy of 85%, indicating its effectiveness 


in classifying different metal types. However, it is important to note variations in 
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performance across metal types. For instance, metal 3 achieved perfect precision, re- 
call, and F1-score, suggesting the model's ability to accurately classify this particular 
metal type. On the other hand, metal 8 exhibited lower precision and recall scores, 
indicating potential challenges in distinguishing this metal type from others. Both 
macro-average and weighted-average metrics hover around 0.85, indicating a reason- 
ably balanced performance across all metal types. These metrics consider the average 
performance across all classes, with macro-average treating all classes equally, while 


weighted-average considers the contribution of each class based on its support. Table 


Precision | Recall | F1-Score | Support 
Ligand 1 0.88 0.78 0.82 9 
Ligand 2 0.88 0.88 0.88 8 
Ligand 3 0.75 0.86 0.80 7 
Ligand 4 0.45 0.71 0.56 7 
Ligand 5 0.78 0.70 0.74 10 
Ligand 6 1.00 0.86 0.92 7 
Ligand 7 0.71 0.50 0.59 10 
Ligand 8 0.67 0.89 0.76 9 
Ligand 9 1.00 0.67 0.80 3 
Ligand 10 1.00 0.90 0.95 10 
Accuracy 0.78 80 
MacroAvg 0.81 0.77 0.78 80 
WeightedAvg 0.80 0.78 0.78 80 


Table 4.4: CV Ligands Classification Report 


4.3 shows the classification report for classifying CV ligands. The classifier achieved 
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an accuracy of 78% overall, indicating its capability to classify different metal types 
to some extent. However, upon closer examination, there are notable variations in 
performance across metal types. For instance, metal 6 demonstrates excellent preci- 
sion, recall, and F1-score, suggesting the model's proficiency in accurately classifying 
this metal type. Conversely, metal 4 exhibits lower precision, recall, and F'1-score, 


indicating challenges in distinguishing this metal type from others. Table 4.5 and 


Precision | Recall | F1-Score | Support 
Ligand 1 1.00 1.00 1.00 1 
Ligand 2 0.00 0.00 0.00 2 
Ligand 3 0.33 0.25 0.29 4 
Ligand 4 0.33 0.33 0.33 3 
Ligand 5 0.50 0.50 0.50 4 
Ligand 6 0.00 0.00 0.00 1 
Ligand 7 0.00 0.00 0.00 1 
Ligand 8 0.00 0.00 0.00 0 
Ligand 9 0.00 0.00 0.00 1 
Ligand 10 1.00 0.33 0.50 3 
Accuracy 0.30 20 
MacroAvg 0.32 0.24 0.26 20 
WeightedAvg 0.42 0.30 0.33 20 


Table 4.5: DPV Ligands Classification Report 


Table 4.6 show the classification reports for DPV ligands and metals. However, the 
small sample size makes it difficult to draw definitive conclusions from this data. To 


further assess the performance of these classification models, receiving operating char- 
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Precision | Recall | F1-Score | Support 
Ligand 1 0.33 1.00 0.50 2 
Ligand 2 0.00 0.00 0.00 1 
Ligand 3 0.25 0.50 0.33 2 
Ligand 4 0.00 0.00 0.00 3 
Ligand 5 0.00 0.00 0.00 1 
Ligand 6 0.50 0.25 0.33 4 
Ligand 7 0.00 0.00 0.00 2 
Ligand 8 1.00 0.50 0.67 2 
Ligand 9 0.00 0.00 0.00 3 
Ligand 10 0.00 0.00 0.00 0 
Accuracy 0.30 20 
MacroAvg 0.23 0.25 0.20 20 
WeightedAvg 0.26 0.25 0.22 20 


Table 4.6: DPV Metals Classification Report 


acteristic (ROC) curves and area under the ROC curve (AUC) values can be used 
to gain valuable insights into the discrimination capabilities and robustness of the 
models when distinguishing between various metals and ligands. ROC curves and 
AUC values help assess the robustness of the classification model by showing how 
well it performs across different thresholds and levels of noise. A smooth ROC curve 
with a high AUC suggests that the model can reliably discriminate between different 


metals and ligands even in the presence of noise or variability. 


Furthermore, there may be a need to choose a classification threshold that balances 
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sensitivity and specificity according to specific requirements or constraints. ROC 
curves provide a visual aid for selecting an appropriate threshold based on the desired 
trade-off between true positives and false positives. For example, when integrating 
with an SDL, minimizing false positives (misclassification of metals or ligands) might 
be prioritized over maximizing true positives. The ROC curves in Figure 4.3 and 


Receiver Operating Characteristic (ROC) curves 


kada micro-average ROC curve (area = 0.97) 
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Figure 4.3: CV Ligand ROC Curves 


Figure 4.4 show promising results for metals and ligands. The area under the ROC 
curve (AUC) calculation summarized the ROC curve analysis into a scalar value, 
which ranges between 0 and 1. The closer the AUC score to value 1, the better the 
application's overall performance. In Figure 4.5 and Figure 4.6, the ROC curves 
show that the classifier outperforms a random classifier by having an AUC value 
above 0.5. The data itself may cause issues with classification as some ligands and 


metals may be more difficult to distinguish than others. The confusion matrices are 
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Receiver Operating Characteristic (ROC) curves 


micro-average ROC curve (area = 0.95) 
ROC curve of metal 1 (area = 0.89) 
ROC curve of metal 2 (area — 1.00) 
ROC curve of metal 3 (area = 0.96) 
ROC curve of metal 4 (area — 0.98) 
ROC curve of metal 5 (area — 0.97) 
ROC curve of metal 6 (area = 0.96) 
ROC curve of metal 7 (area = 0.90) 
ROC curve of metal 8 (area — 0.88) 


v 

S ROC curve of metal 9 (area — 0.95) 

g ROC curve of metal 10 (area = 1.00) 

3 

£ 

v 

E 

False Positive Rate 
. 
Figure 4.4: CV Metal ROC Curves 
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Figure 4.5: DPV Ligand ROC Curves 
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Receiver Operating Characteristic (ROC) curves 
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Figure 4.6: DPV Metal ROC Curves 


provided to investigate this. Figure 4.7 shows the confusion matrix for CV ligands. 
Ligand 7 was often misclassified as ligand 6. However, this misclassification is un- 
derstandable. Figure 4.8 shows that the voltammograms for ligand 6 and ligand 7 
are quite similar. Figure 4.9 shows the confusion matrix for CV metals. Metal 1 was 
difficult to recognize, with many metals being misclassified as metal 1. From the 
DPV confusion matrices seen in Figure 4.11 and Figure 4.10, it is hard to draw any 


definitive conclusions due to the dataset size. 


A significant challenge in supervised learning is providing good examples during train- 
ing. However, despite using a small dataset, these results are promising. This study 
establishes that crude CV data obtained from an economical potentiostat can be ef- 
fectively encoded using CNNs. It was also shown that VAEs and CVAEs can generate 


high-quality, generalizable synthetic data. These findings align with recent research 
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Figure 4.7: CV Ligand Confusion Matrix 


demonstrating that deep learning models can efficiently process CV data [36]. Fu- 
ture research can incorporate group SELFIES |37] within the decoder layer to predict 
or select from a pool of candidate redox groups identified through voltammetry or 


predict the compound used. 
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Figure 4.8: Ligand 6 and Ligand 7 Voltammogram Comparison 
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Figure 4.9: CV Metal Confusion Matrix 
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Figure 4.10: DPV Ligand Confusion Matrix 
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Confusion Matrix 
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Figure 4.11: DPV Metal Confusion Matrix 
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Denoising 


5.1 Introduction 


As the previously discussed dataset was generated using a low-cost potentiostat that 
lacks the accuracy of commercial options [11], we attempt to improve the quality of 
the data obtained by this potentiostat by applying ML to denoise the raw data with 


the commercial potentiostat data as a reference. 


5.2 Autoencoder 


As shown in Figure 4.1, an autoencoder is a neural network used to learn an efficient 
low-dimensional encoding of data. An autoencoder consists of an encoder and a 
decoder. The encoder transforms the input data into an encoded representation, and 
the decoder attempts to recreate the data from the encoded representation. Since 
the goal is to try and improve the data quality, the commercial potentiostat data 
is used for the decoder instead. This way, the low-cost potentiostat data creates an 
encoded representation, and an equivalent commercial potentiostat data is decoded. 
The main problem is how to pair results from the two potentiostats. While the 
metal and ligand used for each experiment are recorded, numerous other variables 


can influence the data. Therefore, the challenge revolves around accurately aligning 
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the outcomes generated by the low-cost potentiostat with their counterparts from the 
commercial one. This alignment is crucial for ensuring the reliability and validity of 
the encoded representations created by the autoencoder. Without precise pairing, the 
encoded representations may not accurately capture the underlying patterns in the 
data, leading to suboptimal performance of the autoencoder. The clustering technique 
previously described can be employed to address this challenge. Similar experimental 
results can be grouped by leveraging the recorded information on metals, ligands, and 
other relevant variables. The clustering process helps identify pairs of results with 
comparable characteristics despite potential variations introduced by the different 


potentiostats. 


The architecture used to denoise the data is similar to the classifier architecture shown 
in Figure 4.2. The main difference is that the output layer uses the Sigmoid activation 


function instead [38]. 


5.3 Results and Discussion 


As seen in Figure 5.1, both the input and output are similar in overall shape. How- 
ever, the output contains a much more defined duck-shaped voltammogram, which 
is typically expected. The results show promising outcomes and indicate that an au- 
toencoder can effectively transform data from the low-cost potentiostat to resemble 
data from the commercial potentiostat. By leveraging the capacity of deep neural 
networks to learn complex patterns and relationships within the data, it becomes 


feasible to enhance the quality of measurements obtained from low-cost instruments, 
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Figure 5.1: Autoencoder Results 


thereby expanding their utility in research and industrial applications. However, 
despite the promising results, several drawbacks and considerations must be acknowl- 
edged. Firstly, the effectiveness of the transformation heavily relies on the quality 
and diversity of the training data. Insufficient or biased training samples may lead 
to suboptimal performance and generalization issues, especially when dealing with 
complex electrochemical processes or diverse experimental conditions. While the au- 
toencoder can effectively capture and replicate the dominant features present in the 
data, it may struggle with preserving subtle nuances or domain-specific characteristics 
inherent to the commercial potentiostat. Variations in hardware specifics, measure- 
ment protocols, or environmental factors could introduce discrepancies between the 
transformed and reference datasets. In conclusion, while autoencoders offer a promis- 


ing avenue for enhancing the capabilities of low-cost potentiostats, their deployment 
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must be accompanied by rigorous validation and consideration of the aforementioned 
limitations. Future research could focus on optimizing the autoencoder architecture, 
exploring alternative deep learning techniques, and investigating strategies for ad- 
dressing data heterogeneity to further improve the robustness and versatility of the 


proposed approach. 
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Conclusion 


In summary, the novel technique introduced for encoding CV and DPV data repre- 
sents a pivotal advancement in the realm of SDLs. By effectively segmenting voltam- 
mograms according to their distinct characteristics and showcasing its effectiveness 
across a spectrum of machine learning applications, from clustering and classification 
to denoising and synthetic data generation, this technique signifies a significant step 
in improving the automation of custom low-cost devices in SDLs. Machine learning 
models able to precisely encode chemical data from characterization results may be 
used to enhance high-throughput operations by integrating multiple low-cost devices 
using the same trained model. This approach streamlines the adoption of SDL and 
high-throughput setups and facilitates their integration into diverse research endeav- 


ours. 


Looking ahead, there is a vast landscape for further exploration, particularly in in- 
vestigating alternative curve simplification algorithms and seamlessly integrating the 
encoding technique into operational SDL frameworks.This approach promises to sig- 
nificantly enhance the efficiency and accuracy of SDL setups and holds the potential 
to revolutionize access to such technologies. By substantially reducing the entry bar- 
riers for new research groups interested in embarking on SDL and high-throughput 
setups, this advancement opens the doors to a more inclusive and collaborative sci- 


entific landscape, sparking new possibilities and inspiring future research. 
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6.1 Data and Code Availability 


All the relevant code can be found on GitHub. The generated database containing 
the raw and processed CV and DPV measurement results can be found in a Zenodo 


data repository (DOI:10.5281/zenodo.10633135) [11]. 
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Form 


CAS 


O. O N C» Ct A WO N RP 


m 
c 


VOSO¿xH20 
CrK(SO4)212H50 


MnSO,4H5O0 
FeSO¿7H20 
CoSO047H20 
NiS0,6H,0 
CuS0,5H,0 
ZnSO,7H5O0 


CdSO,8/3H50 


Na3PdCl, 


Table A.1: Table of Metals 


70 


123334-20-3 


7788-99-0 
10034-96-5 
7782-63-0 
10026-24-1 
10101-97-0 
7758-99-8 
7446-20-0 
7790-84-3 
13820-53-6 
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Ligand 
ammonia 
hydrazine 

ethylenediamine 
ethanolamine 
diethanolamine 
triethanolamine 
piperidine 
morpholine 


pyridine 


2,2’-bipyridine (in HCl salt form) 


SMILES 
N 
NN 
NCCN 
NCCO 
OCCNCCO 
OCCN(CCO)CCO 
NICCCCCI 
NICCOCCI 
nleccecl 


eleco(neljcZeceen? 


Table A.2: Table of Ligands 
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CAS 
1336-21-6 
7803-57-8 

107-15-3 
141-43-5 
111-42-4 
102-71-6 
110-89-4 
110-91-8 
110-86-1 
336-18-7 


